Veri kümemiz temelde, ABD’nin çeşitli bölgelerindeki 1338 bireyden elde edilmiş bazı kişisel bilgileri sunar. Bu bağlamda 1338 gözlem ve 7 değişkenden oluşuyor. İlk olarak, değişkenlerimizi tanıyalım:
age: Birincil yararlanıcının yaşı.
sex: Sigorta müteahhidinin cinsiyeti.
bmi: Vücut kitle indeksi. Bu değişken, yetişkin bir insanın kilosunun boyuna göre normal olup olmadığını gösteren bir parametredir. Boy ağırlık oranını kullanarak vücut ağırlığının objektif indeksini (kg / m ^ 2) hesaplar. İdeal vücut kitle indeksi 18,5 - 24,9 aralığındadır.
children: Sağlık sigortası kapsamındaki sahip olunan çocuk sayısı.
smoker: Sigara kullananlar.
region: Yararlanıcının ABD’deki yerleşim bölgesi (kuzeydoğu, güneydoğu, güneybatı, kuzeybatı).
charges: Sağlık sigortası tarafından faturalanan bireysel tıbbi masraflar.
Bu verideki amacımız Amerikanın çeşitli bölgelerinde yaşayan bir takım insanların yaş, cinsiyet, BMI (vücut kitle indeksi), çocuk sayıları, sigara içme durumları, bölgeleri ve tıbbi masraflarını göz önüne alarak bireysel tıbbi sigorta maliyetlerini tahmin etmektir.
Tıbbi masraflar, yaralanma veya hastalığın önlenmesi veya tedavisi için yapılan masraflardır. Kar elde etmek için sigorta şirketleri, sigortalıya ödenen miktardan daha yüksek prim toplamalıdır. Bu nedenle, sigorta şirketleri bireysel tıbbi maliyetleri doğru bir şekilde tahmin etmek istiyor.
Sağlık sigortası, bir kişinin veya kişilerin sağlık harcamalarını finanse etmek için bir araçtır. ABD’de insanların çoğunluğu, genellikle mevcut bir işveren aracılığıyla alınan özel sağlık sigortasına sahiptir ve azınlık, devlet destekli programlar tarafından kapsanmaktadır.
Verimizde yaptığımız çeşitli analizler sonucunda sağlık sigortası yaptırmak isteyen kişinin sigorta maliyetlerini tahmin etmiş olacağız. Yapılan tahminlere göre sigorta şirketleri sigorta yaptırmak isteyen kişiden kar edebilmek amacıyla bir fiyatlandırma yapacaktır.
Veriler kaynak kitabına göre, ABD Sayım Bürosu’ndan alınan demografik istatistikler temelinde simüle edilmiştir.
Verideki gözlem sayımız 1338 olduğundan bu veriye ilişkin analizlerimize örneklem seçmeden devam edeceğiz.
library(readr)
library(dplyr)
library(VIM)
library(ISLR)
library(funModeling)
library(ggplot2)
library(moments)
library(funModeling)
data <- read.csv("insurance.csv")
data <- as.data.frame(data)
head(data)
## age sex bmi children smoker region charges
## 1 19 female 27.900 0 yes southwest 16884.924
## 2 18 male 33.770 1 no southeast 1725.552
## 3 28 male 33.000 3 no southeast 4449.462
## 4 33 male 22.705 0 no northwest 21984.471
## 5 32 male 28.880 0 no northwest 3866.855
## 6 31 female 25.740 0 no southeast 3756.622
str(data)
## 'data.frame': 1338 obs. of 7 variables:
## $ age : int 19 18 28 33 32 31 46 37 37 60 ...
## $ sex : chr "female" "male" "male" "male" ...
## $ bmi : num 27.9 33.8 33 22.7 28.9 ...
## $ children: int 0 1 3 0 0 0 1 3 2 0 ...
## $ smoker : chr "yes" "no" "no" "no" ...
## $ region : chr "southwest" "southeast" "southeast" "northwest" ...
## $ charges : num 16885 1726 4449 21984 3867 ...
summary(data)
## age sex bmi children
## Min. :18.00 Length:1338 Min. :15.96 Min. :0.000
## 1st Qu.:27.00 Class :character 1st Qu.:26.30 1st Qu.:0.000
## Median :39.00 Mode :character Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## smoker region charges
## Length:1338 Length:1338 Min. : 1122
## Class :character Class :character 1st Qu.: 4740
## Mode :character Mode :character Median : 9382
## Mean :13270
## 3rd Qu.:16640
## Max. :63770
data$smoker <- factor(data$smoker, levels=c("yes","no"))
data$sex <- factor(data$sex, levels=c("female","male"))
data$region <- factor(data$region, levels=c("southeast", "southwest", "northeast", "northwest"))
data[which(is.na(data)),]
## [1] age sex bmi children smoker region charges
## <0 rows> (or 0-length row.names)
colSums(is.na(data))
## age sex bmi children smoker region charges
## 0 0 0 0 0 0 0
data_org <- data
data_miss<-data
aa<-sample(1:nrow(data_miss),floor(nrow(data_miss)*0.05))
data_miss$age[aa]<-NA
head(data_miss[which(is.na(data_miss)),])
## age sex bmi children smoker region charges
## 11 NA male 26.220 0 no northeast 2721.321
## 42 NA female 36.630 2 no southeast 4949.759
## 48 NA female 34.770 0 no northwest 3556.922
## 112 NA female 29.700 2 no southwest 11881.358
## 115 NA male 32.205 3 no northeast 11488.317
## 124 NA male 31.350 1 yes northeast 39556.495
aggr(data_miss,col=c("lightgreen","pink"), numbers=TRUE, sortVars=TRUE, labels=names(data_miss),cex.axis=.7,gap=3,ylab=c("Missing Ratio","Missing Pattern"))
##
## Variables sorted by number of missings:
## Variable Count
## age 0.04932735
## sex 0.00000000
## bmi 0.00000000
## children 0.00000000
## smoker 0.00000000
## region 0.00000000
## charges 0.00000000
df_status(data_miss)
## variable q_zeros p_zeros q_na p_na q_inf p_inf type unique
## 1 age 0 0.0 66 4.93 0 0 integer 47
## 2 sex 0 0.0 0 0.00 0 0 factor 2
## 3 bmi 0 0.0 0 0.00 0 0 numeric 548
## 4 children 574 42.9 0 0.00 0 0 integer 6
## 5 smoker 0 0.0 0 0.00 0 0 factor 2
## 6 region 0 0.0 0 0.00 0 0 factor 4
## 7 charges 0 0.0 0 0.00 0 0 numeric 1337
library(DMwR2)
data_knn <- data_miss
knn_imp <- knnImputation(data_knn, k=5, meth="median")
anyNA(knn_imp)
## [1] FALSE
a <- which(is.na(data_knn$age))
a_knn <- knn_imp$age[a]
a_knn
## [1] 21 35 24 46 36 40 24 20 39 35 49 31 52 47 42 51 40 61 54 32 58 33 30 33 49
## [26] 43 55 34 20 44 43 36 33 32 58 31 53 53 33 22 35 21 45 61 27 40 43 27 51 44
## [51] 55 39 37 43 39 41 30 22 33 52 31 29 20 24 43 34
data_org$age[a]
## [1] 25 31 28 55 52 44 18 18 35 34 55 30 55 25 47 50 45 57 46 35 56 27 48 47 52
## [26] 36 19 18 27 51 52 48 18 38 56 28 51 35 39 24 23 19 35 59 21 61 49 48 57 37
## [51] 49 26 42 39 50 50 43 20 41 18 36 20 18 19 31 52
summary(a_knn)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 20.00 31.00 38.00 38.45 45.75 61.00
summary(data_org$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 27.00 39.00 39.21 51.00 64.00
glimpse(data)
## Rows: 1,338
## Columns: 7
## $ age <int> 19, 18, 28, 33, 32, 31, 46, 37, 37, 60, 25, 62, 23, 56, 27, 1~
## $ sex <fct> female, male, male, male, male, female, female, female, male,~
## $ bmi <dbl> 27.900, 33.770, 33.000, 22.705, 28.880, 25.740, 33.440, 27.74~
## $ children <int> 0, 1, 3, 0, 0, 0, 1, 3, 2, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0~
## $ smoker <fct> yes, no, no, no, no, no, no, no, no, no, no, yes, no, no, yes~
## $ region <fct> southwest, southeast, southeast, northwest, northwest, southe~
## $ charges <dbl> 16884.924, 1725.552, 4449.462, 21984.471, 3866.855, 3756.622,~
summary(data)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 yes: 274
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 no :1064
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## southeast:364 Min. : 1122
## southwest:325 1st Qu.: 4740
## northeast:324 Median : 9382
## northwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
ggplot(data, aes(y=bmi))+
geom_boxplot()
ggplot(data, aes(y=charges))+
geom_boxplot()
ggplot(data,aes(charges))+
geom_histogram( fill="gold",colour="black", alpha=0.7)
altsinir_bmi <- median(data$bmi) - 3 * mad(data$bmi, constant = 1)
altsinir_bmi
## [1] 17.86
ustsinir_bmi <- median(data$bmi) + 3 * mad(data$bmi, constant = 1)
ustsinir_bmi
## [1] 42.94
outlier_bmi <- which(data$bmi < altsinir_bmi | data$bmi > ustsinir_bmi)
outlier_bmi
## [1] 29 117 129 173 233 251 287 293 357 384 402 411 413 429 439
## [16] 443 455 494 522 544 548 550 564 573 583 661 675 681 702 797
## [31] 822 848 861 868 896 931 942 1025 1030 1048 1089 1132 1157 1227 1287
## [46] 1318 1333
summary(data)
## age sex bmi children smoker
## Min. :18.00 female:662 Min. :15.96 Min. :0.000 yes: 274
## 1st Qu.:27.00 male :676 1st Qu.:26.30 1st Qu.:0.000 no :1064
## Median :39.00 Median :30.40 Median :1.000
## Mean :39.21 Mean :30.66 Mean :1.095
## 3rd Qu.:51.00 3rd Qu.:34.69 3rd Qu.:2.000
## Max. :64.00 Max. :53.13 Max. :5.000
## region charges
## southeast:364 Min. : 1122
## southwest:325 1st Qu.: 4740
## northeast:324 Median : 9382
## northwest:325 Mean :13270
## 3rd Qu.:16640
## Max. :63770
set.seed(7357)
trainIndex <- sample(1:nrow(data), size = round(0.8*nrow(data)), replace=FALSE)
train<- data[trainIndex ,]
test <- data[-trainIndex ,]
library("openxlsx")
#write.xlsx(train, 'train.xlsx')
#write.xlsx(test, 'test.xlsx')
glimpse(train)
## Rows: 1,070
## Columns: 7
## $ age <int> 56, 34, 35, 40, 26, 45, 40, 56, 45, 38, 45, 64, 52, 18, 22, 5~
## $ sex <fct> female, female, female, female, male, female, female, female,~
## $ bmi <dbl> 41.910, 27.500, 27.700, 28.120, 27.060, 35.300, 28.690, 26.60~
## $ children <int> 0, 1, 3, 1, 0, 0, 3, 1, 0, 1, 1, 0, 0, 0, 3, 1, 3, 1, 2, 0, 1~
## $ smoker <fct> no, no, no, yes, yes, no, no, no, no, no, no, no, no, no, no,~
## $ region <fct> southeast, southwest, southwest, northeast, southeast, southw~
## $ charges <dbl> 11093.623, 5003.853, 6414.178, 22331.567, 17043.341, 7348.142~
train$age_cat[train$age <= 35] <- "Young Adult"
train$age_cat[train$age >= 36 & train$age <= 55] <- "Senior"
train$age_cat[train$age >= 56] <- "Elder"
train$age_cat <- as.factor(train$age_cat)
train$weight_condition[train$bmi<18.5] <- "Under Weight"
train$weight_condition[train$bmi>=18.5 & train$bmi < 24.9] <- "Normal Weight"
train$weight_condition[train$bmi >= 24.9 & train$bmi < 29.9] <- "Overweight"
train$weight_condition[train$bmi >= 29.9] <- "Obese"
train$weight_condition <- as.factor(train$weight_condition)
avg_charge <- mean(train$charges, trim=0.1)
avg_charge
## [1] 11304.3
train$charge_status[train$charges < avg_charge] <- "Below Average"
train$charge_status[train$charges >= avg_charge] <- "Above Average"
train$charge_status <- as.factor(train$charge_status)
train$child_status[train$children <= 0] <- "Cocugu yok"
train$child_status[train$children > 0] <- "Cocugu var"
train$child_status <- as.factor(train$child_status)
nrow(train)
## [1] 1070
ncol(train)
## [1] 11
summary(train)
## age sex bmi children smoker
## Min. :18.00 female:533 Min. :16.82 Min. :0.000 yes:228
## 1st Qu.:27.00 male :537 1st Qu.:26.32 1st Qu.:0.000 no :842
## Median :40.00 Median :30.27 Median :1.000
## Mean :39.44 Mean :30.62 Mean :1.118
## 3rd Qu.:52.00 3rd Qu.:34.43 3rd Qu.:2.000
## Max. :64.00 Max. :52.58 Max. :5.000
## region charges age_cat weight_condition
## southeast:291 Min. : 1132 Elder :178 Normal Weight:181
## southwest:262 1st Qu.: 4938 Senior :438 Obese :573
## northeast:261 Median : 9575 Young Adult:454 Overweight :301
## northwest:256 Mean :13478 Under Weight : 15
## 3rd Qu.:17004
## Max. :63770
## charge_status child_status
## Above Average:444 Cocugu var:624
## Below Average:626 Cocugu yok:446
##
##
##
##
profiling_num(train)
## variable mean std_dev variation_coef p_01 p_05
## 1 age 39.441121 14.115778 0.3578949 18.0000 18.0000
## 2 bmi 30.620790 6.091033 0.1989182 18.2225 21.1035
## 3 children 1.117757 1.206919 1.0797687 0.0000 0.0000
## 4 charges 13477.862582 12140.470777 0.9007712 1253.4620 1825.4363
## p_25 p_50 p_75 p_95 p_99 skewness kurtosis iqr
## 1 27.000 40.000 52.00 62.0000 64.00 0.03528373 1.746909 25.000
## 2 26.315 30.275 34.43 41.1895 46.53 0.30719200 2.949372 8.115
## 3 0.000 1.000 2.00 3.0000 5.00 0.91358041 3.186884 2.000
## 4 4938.468 9575.442 17003.74 41935.6779 48674.17 1.48736305 4.494524 12065.269
## range_98 range_80
## 1 [18, 64] [19, 59]
## 2 [18.2225, 46.53] [22.895, 38.411]
## 3 [0, 5] [0, 3]
## 4 [1253.46201, 48674.166059] [2455.295555, 34832.758382]
plot_num(train)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
hist(train$charges, col = "lightpink")
freq(train)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## sex frequency percentage cumulative_perc
## 1 male 537 50.19 50.19
## 2 female 533 49.81 100.00
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## smoker frequency percentage cumulative_perc
## 1 no 842 78.69 78.69
## 2 yes 228 21.31 100.00
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## region frequency percentage cumulative_perc
## 1 southeast 291 27.20 27.20
## 2 southwest 262 24.49 51.69
## 3 northeast 261 24.39 76.08
## 4 northwest 256 23.93 100.00
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## age_cat frequency percentage cumulative_perc
## 1 Young Adult 454 42.43 42.43
## 2 Senior 438 40.93 83.36
## 3 Elder 178 16.64 100.00
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## weight_condition frequency percentage cumulative_perc
## 1 Obese 573 53.55 53.55
## 2 Overweight 301 28.13 81.68
## 3 Normal Weight 181 16.92 98.60
## 4 Under Weight 15 1.40 100.00
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## charge_status frequency percentage cumulative_perc
## 1 Below Average 626 58.5 58.5
## 2 Above Average 444 41.5 100.0
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
## child_status frequency percentage cumulative_perc
## 1 Cocugu var 624 58.32 58.32
## 2 Cocugu yok 446 41.68 100.00
## [1] "Variables processed: sex, smoker, region, age_cat, weight_condition, charge_status, child_status"
ggplot(train, aes(x=region,y=charges, fill=region))+
geom_boxplot()+
labs(title="Bölgeler İçin Masraf Kutu Çizimi",
x="Bölge", y = "Masraf")+
scale_fill_discrete(name = "Bölge")+
stat_summary(fun = median, geom="line", group= 1, color= "black", size = 1)
ggplot(train, aes(x=smoker, y=charges, fill=smoker))+
geom_boxplot()+
labs(title="Sigara İçenler İçin Masraf Kutu Çizimi",
x="Sigara İçenler", y = "Masraf")+
scale_fill_discrete(name = "Sigara İçenler")+
stat_summary(fun = median, geom="line", group= 1, color= "black", size = 1)
ggplot(train, aes(x=weight_condition, y=charges, fill=weight_condition))+
geom_boxplot()+
labs(title="Ağırlık Durumu İçin Masraf Kutu Çizimi",
x="Ağırlık Durumu", y = "Masraf")+
scale_fill_discrete(name = "Ağırlık Durumu")+
stat_summary(fun = median, geom="line", group= 1, color= "black", size = 1)
ggplot(train, aes(x=sex, y=charges, fill=sex))+
geom_boxplot()+
labs(title="Cinsiyet İçin Masraf Kutu Çizimi",
x="Cinsiyet", y = "Masraf")+
scale_fill_discrete(name = "Cinsiyet")+
stat_summary(fun = median, geom="line", group= 1, color= "black", size = 1)
ggplot(train, aes(x=age_cat, y=charges, fill=age_cat))+
geom_boxplot()+
labs(title="Yaş Kategorileri İçin Masraf Kutu Çizimi",
x="Yaş Kategorileri", y = "Masraf")+
scale_fill_discrete(name = "Yaş Kategorileri")+
stat_summary(fun = median, geom="line", group= 1, color= "black", size = 1)
ggplot(data = train, aes(x = charges, fill = smoker)) +
geom_density(alpha = 0.5) +
ggtitle("Masraf'ın Sigara Durumuna Göre Dağılımı")
ggplot(train, aes(age,bmi, color=sex, shape=sex))+
geom_point(size=3,alpha=0.6)
ggplot(train, aes(charges,age))+
geom_point(size=2,shape=21,stroke=1,color="dodgerblue1", fill="white")+
geom_smooth(method = "lm", col="darkblue",se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
library(plotly)
## Warning: package 'plotly' was built under R version 4.1.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
##
## subplot
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
d_plot <- ggplot(train, aes(bmi, charges, fill=sex, shape=sex)) +
geom_point(position = position_jitter(width= 0.2, height = 0), size = 2)
ggplotly(d_plot)
library(ggplot2)
ggplot(train, aes(bmi,charges, color=age, size=age))+
geom_point(alpha=0.5, stroke=2)+
scale_size(range = c(1, 8))+
scale_color_gradient(low = "blue", high = "lightpink")
library(ggplot2)
ggplot(train, aes(bmi,charges, color=children, size=children))+
geom_point(alpha=0.5, stroke=2)+
scale_size(range = c(1, 8))+
scale_color_gradient(low = "blue", high = "lightpink")
table <- xtabs(~sex+age_cat+smoker, data=train)
ftable(table)
## smoker yes no
## sex age_cat
## female Elder 14 80
## Senior 39 179
## Young Adult 44 177
## male Elder 19 65
## Senior 51 169
## Young Adult 61 172
library(ggmosaic)
## Warning: package 'ggmosaic' was built under R version 4.1.3
ggplot(train) +
geom_mosaic(aes(x = product(sex, smoker), fill=sex)) +
labs(x = "Sigara ", title='f(Yas Kategorileri, Sigara| Cinsiyet)') +
facet_grid(age_cat~.)
## Warning: `unite_()` was deprecated in tidyr 1.2.0.
## Please use `unite()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
library(aplpack)
new_data<-train%>%
group_by(age_cat) %>%
dplyr::summarize(mean_charges = mean(charges),mean_bmi = mean(bmi),mean_age = mean(age))
faces(new_data[,-1], labels=as.character(new_data$age_cat))
## effect of variables:
## modified item Var
## "height of face " "mean_charges"
## "width of face " "mean_bmi"
## "structure of face" "mean_age"
## "height of mouth " "mean_charges"
## "width of mouth " "mean_bmi"
## "smiling " "mean_age"
## "height of eyes " "mean_charges"
## "width of eyes " "mean_bmi"
## "height of hair " "mean_age"
## "width of hair " "mean_charges"
## "style of hair " "mean_bmi"
## "height of nose " "mean_age"
## "width of nose " "mean_charges"
## "width of ear " "mean_bmi"
## "height of ear " "mean_age"
n<-nrow(train)
train_sorted <- train[order(train$charges),]
a<-(n/2)
b<-(n/2)+1
(train_sorted$charges[a]+train_sorted$charges[b])/2
## [1] 9575.442
median(train$charges)
## [1] 9575.442
mean(train$charges)
## [1] 13477.86
hist(train$charges)
stdev<-sd(train$charges)
mean<-mean(train$charges)
Degisim_kats_charges<-(stdev/mean)*100
quantile(train$charges) # Çeyrek değerler
## 0% 25% 50% 75% 100%
## 1131.507 4938.468 9575.442 17003.737 63770.428
quantile(train$charges, c(.1,.9)) # Customized quantiles
## 10% 90%
## 2455.296 34832.758
q1<-as.vector(quantile(train$charges,0.25))
q3<-as.vector(quantile(train$charges,0.75))
DAG<-q3-q1
DAG
## [1] 12065.27
genislik<-max(train$charges)-min(train$charges)
genislik
## [1] 62638.92
sort <- train[order(train$charges),]
medianf<-median(sort$charges)
sort$fmed<-abs(sort$charges-medianf)
sort2 <- sort[order(sort$fmed),]
mad<-median(sort2$fmed)
olduğu anlaşılır.
library(GGally)
## Warning: package 'GGally' was built under R version 4.1.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:ggmosaic':
##
## happy
## The following object is masked from 'package:funModeling':
##
## range01
cor_train<-train[, c(1,3,7)]
library(GGally)
cor(cor_train)
## age bmi charges
## age 1.0000000 0.1103927 0.2801000
## bmi 0.1103927 1.0000000 0.2170559
## charges 0.2801000 0.2170559 1.0000000
plot(cor_train)
ggpairs(cor_train)
library(PerformanceAnalytics)
## Warning: package 'PerformanceAnalytics' was built under R version 4.1.3
## Zorunlu paket yükleniyor: xts
## Warning: package 'xts' was built under R version 4.1.3
## Zorunlu paket yükleniyor: zoo
## Warning: package 'zoo' was built under R version 4.1.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following objects are masked from 'package:moments':
##
## kurtosis, skewness
## The following object is masked from 'package:graphics':
##
## legend
chart.Correlation(cor_train, histogram=TRUE, pch=19)
ggplot(train, aes(x=age_cat,y=charges, fill=age_cat))+
geom_boxplot()+
stat_summary(fun = median, geom="line", group= 1, color= "black", size = 1)
etk_train<-train%>%
group_by(sex,age_cat)%>%
summarise(Median=median(charges))
## `summarise()` has grouped output by 'sex'. You can override using the `.groups`
## argument.
etk_train
## # A tibble: 6 x 3
## # Groups: sex [2]
## sex age_cat Median
## <fct> <fct> <dbl>
## 1 female Elder 13740.
## 2 female Senior 9828.
## 3 female Young Adult 4351.
## 4 male Elder 13121.
## 5 male Senior 9303.
## 6 male Young Adult 4438.
ggplot(etk_train, aes(x = age_cat, y = Median,color=sex,group=sex)) +
geom_line() +
geom_point()
ggplot(train,aes(charges))+
geom_histogram(aes(y=..density..), fill="white", color="black")+
geom_density(alpha=.4,fill="blue")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
library(ryouready)
## Warning: package 'ryouready' was built under R version 4.1.3
qqcharges <- qqnorm_spss(train$charges)
ggplot(qqcharges)+
labs(title="Normal Q-Q Çizimi")+
theme(axis.title.x = element_text(color="black", face="bold", size=12),
axis.title.y = element_text(color="black", face="bold",size=12),
plot.title = element_text(hjust = 0.5,color="black", face="bold", size=14))
skewness(train$charges)
## [1] 1.487363
kurtosis(train$charges)
## [1] 1.494524
profiling_num(train$charges)
## variable mean std_dev variation_coef p_01 p_05 p_25 p_50
## 1 var 13477.86 12140.47 0.9007712 1253.462 1825.436 4938.468 9575.442
## p_75 p_95 p_99 skewness kurtosis iqr
## 1 17003.74 41935.68 48674.17 1.487363 4.494524 12065.27
## range_98 range_80
## 1 [1253.46201, 48674.166059] [2455.295555, 34832.758382]
dt1 <- table(train$charge_status,train$smoker)
prop.table(dt1,2)
##
## yes no
## Above Average 1.0000000 0.2565321
## Below Average 0.0000000 0.7434679
round(100*prop.table(dt1,2),2)
##
## yes no
## Above Average 100.00 25.65
## Below Average 0.00 74.35
library(DescTools)
## Warning: package 'DescTools' was built under R version 4.1.3
## Registered S3 method overwritten by 'DescTools':
## method from
## plot.bagplot aplpack
##
## Attaching package: 'DescTools'
## The following object is masked from 'package:aplpack':
##
## plot.bagplot
## The following objects are masked from 'package:Hmisc':
##
## %nin%, Label, Mean, Quantile
Assocs(dt1)[1:3,1]
## Contingency Coeff. Cramer V Kendall Tau-b
## 0.5256389 0.6178842 0.6178842
OR1 <- OddsRatio(dt1, conf.level=0.95)
OR1
## odds ratio lwr.ci upr.ci
## 1322.45035 82.11365 21298.22378
dt2 <- table(train$age_cat,train$smoker)
round(100*prop.table(dt2,2),2)
##
## yes no
## Elder 14.47 17.22
## Senior 39.47 41.33
## Young Adult 46.05 41.45
Assocs(dt2)[1:3,1]
## Contingency Coeff. Cramer V Kendall Tau-b
## 0.04167082 0.04170705 -0.03970646
dt3 <- table(train$charge_status,train$region)
round(100*prop.table(dt3,2),2)
##
## southeast southwest northeast northwest
## Above Average 43.64 37.02 44.44 40.62
## Below Average 56.36 62.98 55.56 59.38
dt4<- table(train$charge_status,train$child_status)
round(100*prop.table(dt4,2),2)
##
## Cocugu var Cocugu yok
## Above Average 37.50 47.09
## Below Average 62.50 52.91
library(inspectdf)
## Warning: package 'inspectdf' was built under R version 4.1.3
train %>% inspect_types()
## # A tibble: 3 x 4
## type cnt pcnt col_name
## <chr> <int> <dbl> <named list>
## 1 factor 7 63.6 <chr [7]>
## 2 integer 2 18.2 <chr [2]>
## 3 numeric 2 18.2 <chr [2]>
tra_cat<-train %>% inspect_cat()
tra_cat$levels$hastalik
## NULL
tra_cat %>% show_plot()
plot_num(train)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
##
## select
## The following object is masked from 'package:dplyr':
##
## select
Box_charges<- boxcox(train$charges ~ 1,
lambda = seq(-6,6,0.1)) # Try values -6 to 6 by 0.1
Cox_charges<- data.frame(Box_charges$x, Box_charges$y)
Cox_charges <- Cox_charges[order(-Cox_charges$Box_charges.y),]
Cox_charges[1,]
## Box_charges.x Box_charges.y
## 62 0.1 -3631.633
lambda <- Cox_charges[1, "Box_charges.x"]
lambda
## [1] 0.1
min(train$charges)
## [1] 1131.507
train$charges_kok<-sqrt(train$charges)
hist(train$charges_kok)
train$charges_log <- log10(train$charges)
hist(train$charges_log)
ggplot(train, aes(bmi,charges))+
geom_point()+
geom_smooth(method = "loess", col="red",se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
g<-ggplot(train, aes(bmi,charges_log,label=rownames(train)))+
geom_point(size=0.90)
g+geom_text(label=rownames(train),nudge_x=0.20,check_overlap=T,size=2.5)+
geom_smooth(method="loess",col="red",se=FALSE)
## `geom_smooth()` using formula 'y ~ x'
#Bmi ve Charges arasındaki, düzleştirme doğrusu düze yaklaştı.
plot_num(train)
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
hist(train$age)
train$age_log <- log10(train$age)
hist(train$age_log)
train$age_kok <- sqrt(train$age)
hist(train$age_kok)
ggplot(train, aes(age,charges))+
geom_point(size=1)+
geom_text(label=rownames(train),nudge_x=0.04,check_overlap=T,size=2.5)+
geom_smooth(method = "loess", col="darkblue",se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
ggplot(train, aes(age,charges_log))+
geom_point(size=1)+
geom_text(label=rownames(train),nudge_x=0.04,check_overlap=T,size=2.5)+
geom_smooth(method = "loess", col="darkblue",se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
mean_age<-mean(train$age)
train$age_merk<-(train$age-mean_age)
ggplot(train, aes(x = age_merk, y =charges_log )) +
stat_smooth(method = "lm", se = FALSE, color = "green", formula = y ~ x) +
stat_smooth(method = "lm", se = FALSE, color = "blue", formula = y ~ x + I(x ^ 2)) +
stat_smooth(method = "lm", se = FALSE, color = "red", formula = y ~ x + I(x ^ 2)+ I(x ^ 3)) +
geom_point(colour = "black", size = 1)
Box_bmi<- boxcox(train$bmi ~ 1,
lambda = seq(-6,6,0.1)) # Try values -6 to 6 by 0.1
Cox_bmi<- data.frame(Box_bmi$x, Box_bmi$y)
Cox_bmi <- Cox_bmi[order(-Cox_bmi$Box_bmi.y),]
Cox_bmi[1,]
## Box_bmi.x Box_bmi.y
## 65 0.4 -2014.825
lambda <- Cox_bmi[1, "Box_bmi.x"]
lambda
## [1] 0.4
library(rcompanion)
## Warning: package 'rcompanion' was built under R version 4.1.3
bmi_tukey<- transformTukey(train$bmi, plotit=FALSE)
##
## lambda W Shapiro.p.value
## 417 0.4 0.9983 0.3597
##
## if (lambda > 0){TRANS = x ^ lambda}
## if (lambda == 0){TRANS = log(x)}
## if (lambda < 0){TRANS = -1 * x ^ lambda}
hist(train$bmi)
train$bmi_log<-log10(train$bmi)
hist(train$bmi_log)
train$bmi_kok<-sqrt(train$bmi) #Age'de kok dönüsümü
hist(train$bmi_kok)
library(fBasics)
## Warning: package 'fBasics' was built under R version 4.1.3
## Zorunlu paket yükleniyor: timeDate
##
## Attaching package: 'timeDate'
## The following objects are masked from 'package:PerformanceAnalytics':
##
## kurtosis, skewness
## The following objects are masked from 'package:moments':
##
## kurtosis, skewness
## Zorunlu paket yükleniyor: timeSeries
## Warning: package 'timeSeries' was built under R version 4.1.3
##
## Attaching package: 'timeSeries'
## The following object is masked from 'package:zoo':
##
## time<-
shapiro.test(train$bmi_log)
##
## Shapiro-Wilk normality test
##
## data: train$bmi_log
## W = 0.99502, p-value = 0.001375
shapiro.test(train$bmi_kok)
##
## Shapiro-Wilk normality test
##
## data: train$bmi_kok
## W = 0.99822, p-value = 0.3322
jarqueberaTest(train$bmi_log)
##
## Title:
## Jarque - Bera Normalality Test
##
## Test Results:
## STATISTIC:
## X-squared: 10.6273
## P VALUE:
## Asymptotic p Value: 0.004924
##
## Description:
## Mon Aug 01 23:51:50 2022 by user: beyza
jarqueberaTest(train$bmi_kok)
##
## Title:
## Jarque - Bera Normalality Test
##
## Test Results:
## STATISTIC:
## X-squared: 2.3639
## P VALUE:
## Asymptotic p Value: 0.3067
##
## Description:
## Mon Aug 01 23:51:50 2022 by user: beyza
head(train)
## age sex bmi children smoker region charges age_cat
## 1173 56 female 41.91 0 no southeast 11093.623 Elder
## 105 34 female 27.50 1 no southwest 5003.853 Young Adult
## 786 35 female 27.70 3 no southwest 6414.178 Young Adult
## 281 40 female 28.12 1 yes northeast 22331.567 Senior
## 1275 26 male 27.06 0 yes southeast 17043.341 Young Adult
## 934 45 female 35.30 0 no southwest 7348.142 Senior
## weight_condition charge_status child_status charges_kok charges_log
## 1173 Obese Below Average Cocugu yok 105.32627 4.045073
## 105 Overweight Below Average Cocugu var 70.73792 3.699305
## 786 Overweight Below Average Cocugu var 80.08856 3.807141
## 281 Overweight Above Average Cocugu var 149.43750 4.348919
## 1275 Overweight Above Average Cocugu yok 130.55015 4.231555
## 934 Obese Below Average Cocugu yok 85.72130 3.866178
## age_log age_kok age_merk bmi_log bmi_kok
## 1173 1.748188 7.483315 16.5588785 1.622318 6.473793
## 105 1.531479 5.830952 -5.4411215 1.439333 5.244044
## 786 1.544068 5.916080 -4.4411215 1.442480 5.263079
## 281 1.602060 6.324555 0.5588785 1.449015 5.302829
## 1275 1.414973 5.099020 -13.4411215 1.432328 5.201923
## 934 1.653213 6.708204 5.5588785 1.547775 5.941380
orj<-train[,c(1,3,7)]
library(PerformanceAnalytics)
chart.Correlation(orj, histogram=TRUE, pch=19, method="kendall")
transform_train<-train[,c(1,18,13)]
chart.Correlation(transform_train, histogram=TRUE, pch=19, method="kendall")
Bu verideki amacımız Amerika’nın çeşitli bölgelerinde yaşayan bir takım insanların yaş, cinsiyet, BMI (vücut kitle indeksi), çocuk sayıları, sigara içme durumları, bölgeleri ve tıbbi masraflarını göz önüne alarak bireysel tıbbi sigorta maliyetlerini tahmin etmekti. Yaptığımız birçok analiz sonunda bazı gerçeklerle yüzleştik. Mesela sigara içenlerin her yıl tıbbi masrafları sigara içmeyenlere göre çok daha fazla. Bu durum da bize sigara içen insanların sağlık sorunlarının daha fazla olduğunu gösteriyor. Verisetindeki yaşlı insanların genç insanlara göre daha fazla olduğunu da yaptığımız analizlerde gözlemlemiş olduk. Obezite olan insanların da daha çok sağlık masrafı yaptığını inceledik. Bölgelere, cinsiyete, çocuk sayılarına göre çok bir farklılık olmasa da yaşın, sigara içme durumunun, vücut kitle endeksinin tıbbi sağlık maliyetlerinde büyük fark yarattığını, sağlığımız için yaşımızın ilerlemesine engel olamasak da vücut kitle endeksimizi normal tutup sigara kullanımını sonlandırabiliriz…
● https://kolanobezitemerkezi.com/vucut-kitle-indeksi/
● https://www.kaggle.com/datasets/mirichoi0218/insurance?datasetId=13720&sortBy=voteCount
● https://www.kaggle.com/code/tugrulyilmaz/kural-tabanli-siniflandirma
● https://medium.com/@bugcekrpi.1/r-markdown-bi%C3%A7imlendirme-c5fd4eb445ed